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Abstract 

We information-theoretically reformulate two measures of capacity from statisti- 
cal learning theory: empirical VC-entropy and empirical Rademacher complexity. 
We show these capacity measures count the number of hypotheses about a dataset 
that a learning algorithm /a/jj^e* when it finds the classifier in its repertoire min- 
imizing empirical risk. It then follows from that the future performance of pre- 
dictors on unseen data is controlled in part by how many hypotheses the learner 
falsifies. As a corollary we show that empirical VC-entropy quantifies the mes- 
sage length of the true hypothesis in the optimal code of a particular probabiUty 
distribution, the so-called actual repertoire. 



1 Introduction 

This note relates the number of hypotheses falsified by a learning algorithm to the expected future 
performance of the predictor it outputs. It does so by reformulating two basic results from statistical 
learning theory information-theoretically. 

Suppose we wish to predict an unknown physical process a* : X ^ y occurring in nature after 
observing its outputs {y\, . . . ,yi) on sample V = (xi, . . . , x;) of its inputs, where inputs arise 
according to unknown distribution P. One method is to take a repertoire T of functions from 
X ^ y and choose the predictor f G T that best approximates a* on the observed data. How 
confident can we be in /'s future performance on unseen data? 

Statistical learning theory provides bounds on /'s expected future performance by quantifying a 
tradeoff implicit in the choice of repertoire J^. At first glance, the bigger the repertoire the better 
since the best approximation to a* in J" can only improve as more more functions are added to 
However, increasing J^, and improving the approximation on observed data, can reduce future 
performance due to overfitting. As a result, the bounds depend on both the accuracy with which / 
approximates a* on the observed data and the capacity of repertoire T, see Theorems 9 and 10. 

We wish to connect statistical leaming theory with Popper's ideas about falsification. Popper argued 
that no amount of positive evidence confirms a theory [llj. Rather, theories should be judged on the 
basis of how many hypotheses they falsify. A theory is falsifiable if there are possible hypotheses 
about the world (i.e. data) that are not consistent with the theory. A bold theory falsifies (disagrees 
with) many potential hypotheses about observed data. Testing a bold theory, by checking that the 
hypotheses it disagrees with are in fact false, provides corroborating evidence. If a theory has been 
thoroughly tested then (perhaps) we can have confidence in its predictions. Popper's criticism of 
positive confirmation was devastating. However, and hence the "perhaps", he failed to provide a 
rationale for trusting the predictions of severely tested theories. 

To understand how falsifying hypotheses affects future performance we reformulate leaming as a 
kind of measurement. Before doing so, we need to describe precisely what we mean by measure- 
ment. 
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Given physical system X with state space S{X), a classical measurement is a function / : S{X) — > 
M. For example a thermometer / maps configurations (positions and momenta) of particles in the 
atmosphere to real numbers. When the thermometer outputs 15°C it generates information by spec- 
ifying that atmospheric particles were in a configuration in /~^(15) C S{X). The information 
generated by the thermometer is a brute physical fact depending on how the thermometer is built 
and its output. We quantify the information, see §2, by comparing the size of the total configu- 
ration space S{X) with the size of the pre-image /~^(15). The smaller the pre-image, the more 
informative the measurement, see §2 for details. 

More generally, any (classical) physical process f : X ^ y can be thought of as performing 
measurements by taking inputs in X to outputs in y. Section §4 introduces an important example, 
the min-risk Rj^.-p : S(A',3^) M, which outputs the minimum value of the empirical risk over 
repertoire on a hypothesis space S(A', 3^). Finding the min-risk is a necessary step in finding the 
best approximation / to <t* in F. Since computing the min-risk requires actually implementing it 
as a physical process somehow or other, the measurements it performs and the effective informa- 
tion it generates are brute physical facts, no different in kind than the information generated by a 
thermometer. 

It turns out that the min-risk categorizes hypotheses in E according to how well they are approxi- 
mated by predictors in repertoire T. Proposition 12 shows that the effective information generated 
by the min-risk is (essentially) the empirical VC-entropy. Moreover, the effective information gen- 
erated by the min-risk "counts" the number of hypotheses about V that F falsifies, see Eq. (13). As 
a consequence. Corollary 13, we obtain that the future performance of predictor / is controlled by 
(i) how well / fits the observed data; (ii) how many hypotheses about the data the min-risk rules out 
and (iii) a confidence term. 

It follows that, assuming the assumptions of the theorems below hold, bounds on future perfor- 
mance are brute physical facts resulting from the act of minimizing empirical risk, and so falsifying 
potential hypotheses, on observed data. 

A consequence of our results. Corollary 15, is that empirical VC-entropy is essentially the minimal 
length of the true hypothesis under the optimal code for the actual repertoire (a distribution depend- 
ing on the min-risk). This suggests there may be interesting connections between VC-theory and the 
minimum message length (MML) approach to induction proposed by Wallace and Boulton [15, 16]. 

Finally, section §4.2 reformulates empirical Rademacher complexity via falsification. Here we buUd 
on Solomonoff's probability distribution introduced in [12]. In short, we take Solomonoff's defi- 
nition and substitute the min-risk in place of the universal Turing machine, thereby obtaining what 
we refer to as the Rademacher distribution - a non-universal analog of Solomonoff's distribution. 
Rademacher complexity is then computed using the expectation of the min-risk over the Rademacher 
distribution, see Proposition 17. 

The min-risk thus provides a bridge that not only connects VC-theory to a computable analog of 
Solomonoff's seminal distribution, but also sheds light on how falsification provides guarantees on 
future performance. 

Related work. The connection between Popper's ideas on falsifiability and statistical learning the- 
ory was pointed out in [5,7, 14]. However, these works focus on VC-dimension, which does not 
relate to falsification as directly as VC-entropy and Rademacher complexity which we consider 
here. Further, VC-entropy is a more fundamental concept in statistical learning theory than VC- 
dimension since VC-dimension is defined in terms of the Umit behavior of the growth fimction, 
which is an upper bound on VC-entropy [14]. For more details on the link between MML and 
algorithmic probability, see [17]. 

Acknowledgements. 1 thank David Dowe and Samory Kpotufe for useful comments on an earlier 
version of this paper. 

2 ]VIeasurement 

We consider a toy universe containing probabilistic mechanisms (input/output devices) of the fol- 
lowing form 
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Definition 1 Given finite sets X and y, a mechanism is a Markov matrix m defined by conditional 
probability distribution Pxa{y\x). 

Mechanisms generate information about their inputs by assigning them to outputs [1, 2]. 

Definition 2 The actual repertoire (or measurement) specified by m outputting y is the probability 
distribution 

Vm\x\y) ■= p^y-^ -Punifix), 

where J3„„i/(x) = is the uniform distribution. The effective information generated by the 
measurement is 

ei{m,y) := H Pm{X\y) Punif{X) 
where H\p\^q\ = '}2iPi log2 ^ Kullback-Leibler divergence. 

The Kullback-Leibler divergence -ff [pH?] can be interpreted informally as the number of Y/N ques- 
tions needed to get from distribution q to distribution p. However, as pointed out in [6], Kullback- 
Leibler divergence is invariant with respect to the "framing of the problem" - the ordering and 
structure of the questions - suggesting it is a suitable measure of information-theoretic "effort". 

The definition of measurement is motivated by the special case where pm assigns probabilities that 
are either or 1; in other words, when it corresponds to a set- valued function f : X ^ y. The 
measurement performed by / is 

if f{x) = y 



where | • | denotes cardinality. The support of pf{X\y) is the preimage f~^{y) C X. All elements 
of the support are assigned equal probability - they are treated as an undifferentiated Ust. The 
measurement Pm{X\y) therefore generalizes the notion of preimage to the probabiUstic setting. 

The effective information generated by / outputting y is ei{f, y) = log2 Tj^^rj^y 



ei{f,y) = \og^\X\ - log2|,rHy)l 

= ^no. potential inputs^ — ^no. inputs in pre-image 

= ^no. inputs ruled out^ , 



(1) 



where inputs are counted in bits (after logarithming). Effective information is maximal (log2 l-^*] 
bits) when a single input leads to y, and is minimal (0 bits) when all inputs lead to y. In the first 
case, observing / output y tells us exactly what the input was, and in the latter case, it tells us nothing 
at all. 



2.1 Semantics 

Next we consider two approaches to characterizing the meaning of measurements. The first relates to 
possible world semantics [9]. Here, the meaning of a sentence is given by the set of possible worlds 
in which it is true. Meaning is thus determined by considering all counterfactuals. For example, the 
meaning of "That car is 10 years old" is the set of possible worlds where the speaker is pointing to 
a car manufactured 10 years previously. Since the set of contains cars of many different colors, we 
see that color is irrelevant to the meaning of the sentence. 

More precisely, the meaning of sentence <S is a map from possible worlds W to truth values vs ■ 
W {0, 1}. Equivalently, the meaning of a sentence is 

^possible worlds^ D ^worlds where <S is true^ . 
Inspired by possible world semantics, we propose 
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Figure 1 : The effective information generated by measurements. (A) A deterministic 
device can receive 144 inputs and produce 3 outputs. (B): Eacli input is implicitly assigned to a 
category (shaded areas). The information generated by the dark gray output is logj 144 - logj 9 = 4 
bits. 



Definition 3 The meaning of output y by mechanism m is 

PunifiX) ^ Pm{X\y) 

(^possible inputs^ (jnputs that cause . 

For a deterministic function this reduces to X D f^^{y)- 

Grounding meanings in mechanisms yields four advantages over the possible worlds approach. First, 
it replaces the difficult to define notion of a possible world with the concrete set of inputs the mech- 
anism is physically capable of receiving. Second, in possible world semantics the work of deter- 
mining whether or not a sentence is true is performed somewhat mysteriously offstage, whereas 
the meaning of a measurement is determined via Bayes' rule. Third, the approach generalizes to 
probabilistic mechanisms. Finally, we can compute the effective information generated by a mea- 
surement, whereas there is no way to quantify the information content of a sentence in possible 
world semantics. 



2.2 Risk 



The second, pragmatic notion of meaning characterizes usefulness. We consider a special case, well 
studied in statistical learning theory, where usefulness relates to predictions [14]. 

Let T,{X,y) — {(J : X y} be the set of all functions (deterministic mechanisms) mapping X 
to y = {— 1,+1}. We will often write E for short. Suppose there is a random variable X taking 
values in X with unknown distribution P and an unknown mechanism a* E E, the supervisor, who 
assigns labels to elements of X. 

Definition 4 The risk quantifies how well mechanism f approximates an unknown or partially 
known mechanism a*: 

R{f) = J2l[l{x)^a*{x)]-p{x). (4) 

It is the probability that f and a* disagree on elements of X. 
Unfortunately, the risk cannot be computed since P and cr* are unknown. 
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Definition 5 Given a finite sample V = [xi, . . . ,xi) G with labels C = a*V = {yi, . . . ,yi) G 
y'', the empirical risk off:X^y 

1 ' 

R(/,P,£) = -Vl[/(a;i)y^2/i] (5) 
' i=i 

is the fraction of the data V on which f and a* disagree. 

The empirical risk provides a computable approximation to the (true) risk. 

Remark 6 Note that in this paper, sets X and y are both finite. Similarly, the training data V G X'^ 
and labels £ Gy^ also live in finite sets. 

3 Statistical learning theory 

Suppose we wish to predict the unknown supervisor a* based on its behavior on labeled data {T>, C). 
A simple way to find a mechanism in repertoire T C ll{X,y) that approximates a* well is to 
minimize the empirical risk. 

Definition 7 Given repertoire T CT, and unlabeled data V G X\ define learning algorithm 

Aj^,T> : E J" : C7 arg min R(/, V, aV) (6) 

which finds the mechanism in T that minimizes empirical risk. 

Learning algorithm Aj^^v finds the mechanism in that appears, based on the empirical risk, to 
best approximate a* . Empirical risk stays constant or decreases as F is enlarged, suggesting that 
the larger the repertoire the better 

This is not true in general since minimizing risk - and not empirical risk - is the goal. There is 
a tradeoff: increasing the size of leads to overfitting the data which can increase risk even as 
empirical risk is reduced. 

The tendency of a repertoire to overfit data depends on its size or capacity. We recall two mea- 
sures of capacity that are used to boimd risk: empirical VC-entropy [13] and empirical Rademacher 
complexity [8]. 

Definition 8 Given unlabeled data D G X^ and repertoire J- C let 

qv-.T^R' (/(.xi),...,.f(x,)). (7) 

The empirical VC-entropy^ of F on V is V{J^,V) := logj \q-D{J^)\, where \q-D{F)\ is the number 
of distinct points in the image ofq-p. 



The empirical Rademacher complexity ofTonV is 

I 



1=1 



(8) 



VC-entropy "counts" how many labelings of T> the classifiers in T fit perfectly. Rademacher com- 
plexity is a weighted count of how many labeUngs of T> functions in T fit well. 

The following theorems are shown in [3] and [4] respectively: 

Tlieorem 9 (empirical VC-entropy bound) 

With probability 1 — 6, the expected risk is bounded by 



R(/) < R(/,P,£) + c. + c^i^ (9) 



for all f gT, where the constants are c\ = ^/ j^^J^ and = J ^ 



'VC-entropy is the expectation of empirical VC-entropy [14]. Also, note the standard definition of VC- 
entropy uses logg rather than logj. 
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Theorem 10 (empirical Rademacher bound) 

For all S > 0, with probability at least 1 — S, 



R(/) < R(/, v,c) + n{:F, p) + C3 y ^ ^ , (10) 

for all f gJ^, where C3 = ^ 

The tradeoff between empirical risk and capacity is visible in the first two terms on the right-hand 
sides of the bounds. 

The left-hand sides of Eqs (9) and (10) cannot be computed since P and a* are unknown. Re- 
markably, the right-hand sides depend only on mechanism / chosen from repertoire F, labeled data 
{T>,C) and desired confidence 5. The theorems assume data is drawn i.i.d. according to P and 
labeled according to cr*; it make no assumptions about the distribution P on or supervisor a*, 
except that they are fixed. 

4 Falsification 

This section reformulates the results from statistical learning theory to show how the past falsifica- 
tions performed by a learning algorithm control future performance. We show that the empirical 
VC-entropies and Rademacher complexities admit interpretations as "counting" (in senses made 
precise below) the number of hypotheses falsified by a particular measurement performed when 
learning. 

We start by introducing a special mechanism, the min-risk, which is used implicitly in learning 
algorithm Aj^j:,. As we will see, the structure of the measurements performed by the min-risk 
determine the capacity of the learning algorithm. 

Definition 11 Given repertoire C S and unlabeled data V S X^, define the min-risk as the 
minimum of the empirical risk on T: 

Rjr,i, : S -S-K : (7H> minR(/,D,(TD). (11) 

The min-risk is a mechanism mapping supervisors cr in E to the empirical risk of their best approx- 
imations Ajrxi{a) in T, see Fig. 2. Note that inputs to the min-risk are themselves mechanisms. 

We suggestively interpret the setup as follows. Suppose a scientist studies a universe where inputs 
in X appear according to distribution P, and are assigned labels in y by unknown physical process 
(7*. The hypothesis space is S(A', J^), the set of all possible (deterministic) physical processes that 
take X to y. 

The scientist's goal is to learn to predict physical process a* , on the basis of a small sample of 
labeled data {V,C). She has a theory, repertoire T, and a method, Aj^^-d, which she uses to fit some 

particular f E J- given C. 

The most important question for the scientist is: How reliable are predictions made by / on new 
data? We will show that /'s reliability depends on the measurements performed by the min-risk - 
i.e. on the work done by the scientist when she appUes method Aj^^t> to find /. 

4.1 Empirical VC entropy 

Empirical VC-entropy is, essentially, the effective information generated by the min-risk when it 
outputs a perfect fit: 

Proposition 12 (VC-entropy via effective information) 

Empirical VC entropy is 

V{T,V) = l-ei{R^,Ty,0)- (12) 
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hypothesis 
space 



hypotheses fit 
perfectly by f 



min-risk 



hypotheses fit 
worst by f 



Figure 2: The structure of the measurement performed by the min-risk. The min-risk 
categorizes potential hypothesis in E according to how well they are fit by mechanisms in theory J^. 



Proof: LetX ^VUV and \X\ = m. Then Y, ^ {a : V ^ y} x {a : V ^ y}. By definition 

e j (R^,i, , 0) = log2 I S I - log2 I R^;^ (0)1, 

with log2 |S| — m. It remains to show that |Rji-^p(0)| = 2'"^' • |gp(J^)|. Points in the image of 
q-D correspond to labehngs a of the data by functions in Thus, |g-p(J^)| counts distinct labehngs 
of V that T fits perfectly. These occur with multiplicity 2'"^' in the pre-image by the product 
decomposition of S above. ■ 

We interpret the result as follows. Suppose the scientist applies theory to explain her labeled data 
and perfectly fits function / = Aj^^vi^r*) with risk e = 0. 

By Definition 3, the meaning of her work is E D Rji^^p(O): the set of mechanisms that her theory 
T fits perfectly. The effective information generated by her work is 

ei(R^,p,0) - log2|S| - log2|R^;p(0)| 

— ^total no. of hypotheses^ — ^no. that theory fits^ (]^3) 

= ^no. of hypotheses falsified^ , 

where hypotheses are counted in bits (after logarithming). A theory is informative if it rules out 
many potential hypotheses [11]. 

The number of hypotheses the scientist falsifies when using theory to fit J has implications for its 
future performance: 

Corollary 13 (information-theoretic empirical VC bound) 

With probability \ — 5, the risk of predictor f = Aj^.vi'^*) outputted by learning algorithm Ajr is 
bounded by 

R(/) < R(/, V, C) + fl^^Z^ + e2 yi^. (14) 
Proof: By Theorem 9 and Proposition 12. ■ 

The corollary states that minimizing empirical risk embeds expectations about the future into pre- 
dictors. So long as the corollary's assumptions hold, future performance by / is controlled by: (i) 
the output of the min-risk, i.e. the fraction e of the data that / fits; (ii) the effective information 
generated by the min-risk, i.e. the number (in bits) of hypotheses the learning algorithm falsifies if it 
fits perfectly; and (iii) a confidence term. The only assumption made by the corollary is that P and 
(J* are fixed. 

Remark 14 The theorem provides no guarantees on the future performance of a theory that "ex- 
plains everything" , i.e. T ^ Yi, no matter how well it fits the data. This follows since effective 
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information is zero when = T,, and so the second term on the right-hand side of Eq. (14) is 
Cl «2. 



Reformulating the above result in terms of code lengths suggests a connection between VC-theory 
and minimum message length (MML), see [16] and §6.6 of [6]. Recall that, given probability distri- 
bution p{X), the message length of event x in an optimal binary code is len(a;) := — logj p{x). 

Corollary 15 (VC-entropy controls code length of true hypothesis) 

Denote the min-risk by vn. = Rj^ p. The length of the true hypothesis a in the optimal code for the 
actual repertoire specified by the min-risk, Pxa{^\e = 0), is 

len{a)=V{F,V) + {\X\-\V\). 



Proof: By Proposition 12 we have — log2Pm('5'|e = 0) = logj |Rjrp(0)|. ■ 

The length of the message describing the true hypothesis in the actual repertoire's optimal code is 
the empirical VC-entropy plus a term, (jA"! — jPl) = {m — I), that decreases as the amount of 
training data increases. The shorter the message, the better the predictor's expected performance 
(for fixed empirical risk). 

4.2 Empirical Rademacher complexity 

VC-entropy only considers hypotheses that theory J" fits perfectly. Rademacher complexity is an 

alternate capacity measure that considers the distribution of risk across the entire hypothesis space. 
This section explains Rademacher complexity via an analogy with Solomonoff probability 1 12, 17J. 

We first recall Solomonoff 's definition. Given universal Turing machine T, define (unnormalized) 
Solomonoff probability 

Pt{s):= J2 ^-'''"W, (15) 

{j|T(i) = s.} 

where the sum is over strings^ i that cause T to output s as a prefix, and len(z) is the length of i. We 
adapt Eq. (15) by replacing Turing machine T with min-risk R^^x" : S ^ M. 

Definition 16 Equipping hypothesis space with the uniform distribution Punifi^), M hypotheses 
have length len{a) = \X\ = log2 |S| in the optimal code. Set the Rademacher distribution for the 
min-risk m = Rjr x> as 



ife e R^,p(S) 



{<T\Rj.,-D(<r)=e} \^ else. 

The Rademacher distribution is constructed following Solomonoff 's approach after substituting the 
min-risk as a "special-purpose Turing machine" that only accepts hypotheses in finite set S as inputs. 
It tracks the fraction of hypotheses in S that yield risk e. 

The Rademacher distribution arises naturally as the denominator when using Bayes' rule to compute 
the actual repertoire Pm{^\c): 



Pmicr\e) = ^"'^^1'^-' -PunifW). wherepm(e|c7) 
Pm(£j 



'1 ifR^,2,(CT) = e 
,0 else. 



Proposition 17 (Rademacher complexity via min-rislt) 

n{F,V) = l-2-¥.[e\p^{e)\. (17) 



technical point is that no proper prefix of i should output s. 
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Proof: We refer to E [e | Pm(e)] as the expected min-risk. From Eq. (8), 



1 1 ' 

^(-^. ^) = M H sup y ^ CT{Xi) ■ f{Xi) 



i=l 



Observe that j X]i=i ""(a^i) • /{xi) = 1 — 2R(/, r>,c7). It follows that sup f^jr j Yl\^i cr{xi) 
f{xi) = 1 — 2R^^x)(c). which imphes 



= 1-2^, 



Rademacher complexity is low if the expected min-risk is high. The expected min-risk admits an 
interesting interpretation. For any hypothesis a G R^^2,(e) the classifier := Aj^^v{cf) € F 
outputted by the learning algorithm yields incorrect answers on fraction e = 7 Si=i ^[/^(a^i) ^ 
<j{xi)\ of the data. It follows that 

= J2e (^fraction of hypotheses falsified^ • ^on fraction e of the data^ . 

A bold theory is one for which E[e|pm(e)] is high, meaning that its predictors (the classifiers it 
tries to fit to data) are sufficiently narrow that it would falsify most hypotheses on most of the data. 

When a hold theory happens to fit labeled data well, it is guaranteed to perform well in future: 
Corollary 18 (information-theoretic empirical Rademacher bound) 

With probability 1 — 5, the risk of predictor f = Aj^{p, C) outputted by learning machine Aj^ is 
bounded by 



R(/)<R(/,P,£) + 



I-2E' 



C3\/ , (18) 



Proof: By Proposition 17 and definition of effective information we have 

e 



The result follows by Theorem 10. ■ 

Rademacher complexity is low if the min-risk's sharp measurements (high eA) are accurate (low e), 
and conversely. Analogously to Corollary 1 3, the Rademacher bound implies the future performance 
of a classifier depends on: (i) the fraction e of the data that / fits; (ii) the weighted (by the fraction 
e of data that falsifies them) sum of the fraction of hypotheses falsified; and (iii) a confidence term. 
Once again, the only assumption is that P and tr* are fixed. 



5 Discussion 



Learning according to algorithm Aj^;d entails computing the min-risk, which classifies hypotheses 
about V according to how well they are approximated by predictors in repertoire F. Repertoires that 
rule out many hypotheses when they fit labeled data {V, C) generate more effective information than 
repertoires that "approximate everything". As a consequence, when and if an informative repertoire 
fits labeled data well, Corollary 13 implies we can be confident in future predictions on unseen data. 

A pleasing consequence of reformulating empirical VC-entropy and empirical Rademacher com- 
plexity in terms of falsifying hypotheses is that it directly cormects Popper's intuition about falsifi- 
able theories to statistical learning theory, thereby providing a rigorous justification for the former. 
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Our motivation for reformulating learning theory information-theoretically arises from a desire to 
better understand the role of information in biology. Although Shannon information has been heavily 
and successfully applied to biological questions, it has been argued that it does not fully capture 
what biologists mean by information since it is not semantic. For example, Maynard Smith states 
that "In biology, the statement that A carries information about B implies that A has the form it does 
because it carries that information" 1 1 0] . Shannon information was invented to study communication 
across prespecified channels, and lacks any semantic content. Maynard Smith therefore argues that 
a different notion of information is needed to understand in what sense evolution and development 
embed information into an organism. 

It may be fruitful to apply statistical learning theory to models of development. One possible ap- 
proach is to consider analogs of repertoire T . For example, T may correspond to the repertoire of 
possible adult forms a zygote could develop into. The particular adult form chosen, / € depends 
on the historical interactions (D, C) between the organism and its environment, assuming these can 
be suitably formalized. The information generated by the organism's development would then have 
implications for its future interactions with its environment. More speculatively, a similar tactic 
could be appUed to quantify the information embedded in populations by inheritance and natural 
selection. 
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